For this problem set we will apply some of the approaches presented in ISLR for variable selection and model regularization to datasets that we have worked with previously. The goal will be to see whether more principled methods for model selection will allow us to better understand relative variable importance, variability of predictive performance of the models, etc.
In the preface we will use algae dataset that we used in the
lectures, to illustrate some of the concepts and approaches utilized
here. Just to do something different for a change, this time we will be
modeling another outcome available available in the algae dataset,
AG2. The problems that you are asked to explore on your own
will use the fundraising dataset from the previous problem sets, but
will be otherwise similar to the examples shown here. The flow of the
presentation in this Preface also closely follows the outline of the
Labs 6.5 and 6.6 in ISLR and you are encouraged to refer to them for
additional examples and details.
algaeRaw <- read.table ("coil.analysis.data.txt", header=F, sep =",", row.names =NULL, na.strings ="XXXXXXX")
colnames (algaeRaw)= c("season","size","velocity",paste0("C",1:8),paste0("AG",1:7))
algaeRaw[1:3,]
## season size velocity C1 C2 C3 C4 C5 C6 C7 C8 AG1
## 1 winter small_ medium 8.00 9.8 60.80 6.238 578.000 105.000 170.000 50.0 0.0
## 2 spring small_ medium 8.35 8.0 57.75 1.288 370.000 428.750 558.750 1.3 1.4
## 3 autumn small_ medium 8.10 11.4 40.02 5.330 346.667 125.667 187.057 15.6 3.3
## AG2 AG3 AG4 AG5 AG6 AG7
## 1 0.0 0.0 0.0 34.2 8.3 0.0
## 2 7.6 4.8 1.9 6.7 0.0 2.1
## 3 53.6 1.9 0.0 0.0 0.0 9.7
# remove cases with undefined values and three outliers:
algae <- na.omit(algaeRaw)
algae <- algae[algae[,"C4"]<max(algae[,"C4"],na.rm=TRUE)&algae[,"C3"]<max(algae[,"C3"],na.rm=TRUE)&algae[,"AG4"]<max(algae[,"AG4"],na.rm=TRUE),]
# log-transform selected attributes:
for ( iCol in 1:8 ) {
if ( iCol > 2 ) {
algae[,paste0("C",iCol)] <- log(algae[,paste0("C",iCol)])
}
if ( iCol < 8 ) {
algae[,paste0("AG",iCol)] <- log(1+algae[,paste0("AG",iCol)])
}
}
# we'll use AG2 as an outcome here:
algaeAG2 <- algae[,!colnames(algae)%in%paste0("AG",c(1,3:7))]
pairs(algaeAG2[,-(1:3)])
Assuming that we have read and pre-processed algae data (omitted
observations with undefined values, log-transformed where necessary and
removed egregious outliers), let’s use regsubsets from
library leaps to select optimal models with the number of
terms ranging from one to all variables in the dataset. We will try each
of the methods available for this function and collect corresponding
model metrics (please notice that we override default value of
nvmax argument; reflect on why we do is and why we use that
specific value – remember that the goal here is to evaluate models up to
and including those that include every predictor
available):
summaryMetrics <- NULL
whichAll <- list()
# for each of the four methods we want to try:
for ( myMthd in c("exhaustive", "backward", "forward", "seqrep") ) {
# (OK, here's your answer: 15 because three categorical attributes are
# represented by dummy variables) -- run regsubsets for the method myMthd:
rsRes <- regsubsets(AG2~.,algaeAG2,method=myMthd,nvmax=15)
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which # get the variable choices for the current method
# and also extract from the summary all the metrics we are interested in:
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(
method=myMthd,
metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]])
)
}
}
#... and now plot everything:
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) +
geom_path() +
geom_point() +
facet_wrap(~metric,scales="free") +
theme(legend.position="top")+
theme_bw()
We can see that, (1) a couple of times sequential replacement selects
models far more inferior to those selected by the rest of the methods;
(2) the “backward” method comes up with models ever so slightly worse
than the rest of the methods for variable numbers
nvars=5:6. Otherwise, we generally come up with very
comparable model performance by every associated metric.
Let us check which variables were actually selected by those
different methods for each model size. We can quickly assess those
selections by directly plotting the “yes/no” variable usage information
encoded by which matrices (which we extracted in the above
code from the model summaries and saved). The plots demonstrate that the
first four variables are always the same and are selected in the same
order as the model size increases, regardless of the method used: the
variable that gets selected first is C8, followed by
C1, then C3, and then
medium size. For larger model sizes, the order in which
different methods select additional variables beyond those four starts
varying:
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
# for each method used:
for ( myMthd in names(whichAll) ) {
# get the previously saved 'which' matrtix and justy plot it using the 'mage' cmd:
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],
xlab="N(vars)", ylab="",
xaxt="n", yaxt="n", # suppress axis plotting, we will draw axes manually in a moment
breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
# use variable names as axis tick labels:
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
par(old.par)
As we have already discussed, it is not generally a great idea to select variables on the whole dataset. Choosing “the right variables” should rather be considered an integral part of the process of model fitting itself: which variables we choose is as much “fitting the data we have in hand” as finding the best model coefficients for any given set of the variables. Thus, we will follow the Lab 6.5.3 in ISLR and split our data approximately evenly into training and test, multiple times. Every time we will be selecting the best subsets of variables (for different model sizes) on the training data only, and then evaluating the performance of those models on the training and test sets. Since with different samples of training data we can, conceivably be selecting different variables (which is the whole idea!), we will also need to record which variables have been selected each time.
In order to be able to use regsubsets output to make
predictions, we follow ISLR and setup an appropriate
predict function that can be directly applied to the
objects returned by regsubsets (notice
.regsubsets in its name – this is how under S3 OOP
framework in R methods are matched to corresponding classes; in the code
that follows we will be just passing objects of class
regsubsets the “generic” predict function, and
the latter will know to dispatch the call to the specialized
implementation predict.rebsubsets() by simply matching the
class name):
# takes `regsubsets` object (as returned by function `regsubsets()`), and also
# the model "id" - since regsubsets stores the best selections for *each*
# model size the function was run for! Applies selected best model to the newdata
# and returns the predictions
predict.regsubsets <- function (object, newdata, id, ...){
form=as.formula(object$call [[2]]) # get the formula used for model selection
# convert data into model matrix (will take care of 1-hot encoding as needed)
mat=model.matrix(form,newdata)
coefi=coef(object,id=id) # get the fitted coefficients of the (best) model of the given size
xvars=names (coefi) # figure out which variables are actually used by that model
mat[,xvars] %*% coefi # extract just those variables from the data and compute predictions
}
Now we are all set to repeatedly draw training sets, choose the best
set of variables on them by each of the four different methods available
in regsubsets, calculate test error on the remaining
samples, etc. To summarize variable selection over multiple splits of
the data into training and test, we will use a 3-dimensional
array whichSum – first two dimensions match the
which matrix (i.e. the number of variables used, which is
15 in this case as we discussed above), and the third dimension
corresponds to the four methods available in
regsubsets.
(if you find it difficult thinking in terms of 3D-arrays, you can
change the code to make whichSum a list of four
15x16 matrices, each corresponding to specific variable selection
method).
In order to split the data into training and test we will again use
sample() function – those who are curious and are paying
attention may want to reflect on the difference in how it is done below
and how it is implemented in the Ch. 6.5.3 of ISLR and think about the
consequences. (Hint: consider how size of training or test datasets will
vary from one iteration to another in these two implementations)
dfTmp <- NULL
whichSum <- array(0,dim=c(15,16,4),
dimnames=list(NULL,colnames(model.matrix(AG2~.,algaeAG2)),
c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(algaeAG2)))
# Try each method available in regsubsets
# to select the best model of each size, using training data only:
for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
rsTrain <- regsubsets(AG2~.,algaeAG2[bTrain,],nvmax=15,method=jSelect)
# Add up variable selections: each cell of the resulting matrix will literally
# count *how many times* each particular variable was selected for each
# particular model size across all the random resamplings of the data:
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# Calculate test error for each set of variables (each model size)
# using predict.regsubsets implemented above:
for ( kVarSet in 1:15 ) {
# make predictions:
testPred <- predict(rsTrain,algaeAG2[!bTrain,],id=kVarSet)
# calculate MSE:
mseTest <- mean((testPred-algaeAG2[!bTrain,"AG2"])^2)
# add to data.frame for future plotting:
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
# plot MSEs by training/test, number of
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)+theme_bw()
We can see clear difference in the behavior of training and test error with the increase in the number of attributes added to the model. The training error gradually decreases (although for larger numbers of predictors included in the model, the difference between median errors is small comparing to the variability of the error across multiple splits of the data into training and test). Test error shows clear decrease upon addition of the second attribute to the model followed by steady increase with the inclusion of three or more attributes in the model. Therefore we can conclude that models with three or more attributes are very likely to overfit in this case.
Below we plot the fraction of times each variable was included in the best model of every given size, by each of the four methods (darker shades of gray indicate the fraction closer to 1). We can see from the plots that the selection of C1 and C8 as the first two best variables is quite consistent (although the order may be different, depending on the training data (!!); and on some rare occasions – i.e. very light shades of gray, – even some other variables may be selected as the best ones for \(n=1\) or \(n=2\)-variable model). Past the model sizes 3-4, it looks like pretty much any variable can be picked as the next-best one depending on particular train-test split, with comparable frequencies. This is consistent with the observation that models beyond that size are overfitting, so they are indeed too eager to pick any variable that happens to fit the noise (!) better in any particular data split.
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in dimnames(whichSum)[[3]] ) {
tmpWhich <- whichSum[,,myMthd] / nTries
image(1:nrow(tmpWhich),1:ncol(tmpWhich),tmpWhich,
xlab="N(vars)",ylab="",xaxt="n",yaxt="n",main=myMthd,
breaks=c(-0.1,0.1,0.25,0.5,0.75,0.9,1.1),
col=c("white","gray90","gray75","gray50","gray25","gray10"))
axis(1,1:nrow(tmpWhich),rownames(tmpWhich))
axis(2,1:ncol(tmpWhich),colnames(tmpWhich),las=2)
}
par(old.par)
Similar observations can be made using cross-validation rather than the split of the dataset into training and test that is omitted here for the purposes of brevity.
As explained in the lecture and ISLR Ch.6.6 lasso and ridge
regression can be performed by glmnet function from library
glmnet – its argument alpha governs the form
of the shrinkage penalty, so that alpha=0 corresponds to
ridge and alpha=1 – to lasso regression. The arguments to
glmnet differ from those used for lm for
example and require specification of the matrix of predictors and
outcome separately. model.matrix is particularly helpful
for specifying matrix of predictors by creating dummy variables for
categorical predictors:
# -1 to get rid of intercept that glmnet knows to include:
x <- model.matrix(AG2~.,algaeAG2)[,-1]
head(algaeAG2)
## season size velocity C1 C2 C3 C4 C5 C6
## 1 winter small_ medium 8.00 9.8 4.107590 1.8306596 6.359574 4.653960
## 2 spring small_ medium 8.35 8.0 4.056123 0.2530906 5.913503 6.060874
## 3 autumn small_ medium 8.10 11.4 3.689379 1.6733512 5.848365 4.833636
## 4 spring small_ medium 8.07 4.8 4.348522 0.8337783 4.586823 4.113853
## 5 autumn small_ medium 8.06 9.0 4.013677 2.3433431 5.454038 4.064263
## 6 winter small_ high__ 8.25 13.1 4.185860 2.2244073 6.063785 2.904165
## C7 C8 AG2
## 1 5.135798 3.9120230 0.000000
## 2 6.325702 0.2623643 2.151762
## 3 5.231413 2.7472709 4.000034
## 4 4.932313 0.3364722 3.737670
## 5 4.580673 2.3513753 1.360977
## 6 4.037192 3.3463891 2.747271
# notice how it created dummy variables for categorical attributes
head(x)
## seasonspring seasonsummer seasonwinter sizemedium sizesmall_ velocitylow___
## 1 0 0 1 0 1 0
## 2 1 0 0 0 1 0
## 3 0 0 0 0 1 0
## 4 1 0 0 0 1 0
## 5 0 0 0 0 1 0
## 6 0 0 1 0 1 0
## velocitymedium C1 C2 C3 C4 C5 C6 C7
## 1 1 8.00 9.8 4.107590 1.8306596 6.359574 4.653960 5.135798
## 2 1 8.35 8.0 4.056123 0.2530906 5.913503 6.060874 6.325702
## 3 1 8.10 11.4 3.689379 1.6733512 5.848365 4.833636 5.231413
## 4 1 8.07 4.8 4.348522 0.8337783 4.586823 4.113853 4.932313
## 5 1 8.06 9.0 4.013677 2.3433431 5.454038 4.064263 4.580673
## 6 0 8.25 13.1 4.185860 2.2244073 6.063785 2.904165 4.037192
## C8
## 1 3.9120230
## 2 0.2623643
## 3 2.7472709
## 4 0.3364722
## 5 2.3513753
## 6 3.3463891
y <- algaeAG2[,"AG2"]
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)
Plotting output of glmnet illustrates change in the
contributions of each of the predictors as amount of shrinkage changes.
In ridge regression each predictor contributes more or less over the
entire range of shrinkage levels.
Output of cv.glmnet shows averages and variabilities of
MSE in cross-validation across different levels of regularization.
lambda.min field indicates values of \(\lambda\) at which the lowest average MSE
has been achieved, lambda.1se shows larger \(\lambda\) (more regularization) that has
MSE 1SD (of cross-validation) higher than the minimum – this is an often
recommended \(\lambda\) to use under
the idea that it will be less susceptible to overfit. You may find it
instructive to experiment by providing different levels of lambda other
than those used by default to understand sensitivity of
gv.glmnet output to them. predict depending on
the value of type argument allows to access model
predictions, coefficients, etc. at a given level of lambda:
cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)
cvRidgeRes$lambda.min
## [1] 0.8415975
cvRidgeRes$lambda.1se
## [1] 3.7288
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -3.4201892833
## seasonspring 0.0037729438
## seasonsummer -0.0060035842
## seasonwinter -0.0975006229
## sizemedium -0.1480074660
## sizesmall_ -0.0593286200
## velocitylow___ 0.1251026812
## velocitymedium 0.1729267672
## C1 0.4956087322
## C2 -0.0005789763
## C3 0.1063279517
## C4 -0.0087399612
## C5 -0.0163725209
## C6 0.0739619299
## C7 0.0308752689
## C8 0.1288509916
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.134278752
## seasonspring 0.008454967
## seasonsummer -0.012326674
## seasonwinter -0.035164346
## sizemedium -0.048241805
## sizesmall_ -0.061382041
## velocitylow___ 0.091905902
## velocitymedium 0.086372127
## C1 0.242471637
## C2 -0.006082201
## C3 0.060972178
## C4 0.001838859
## C5 0.003395528
## C6 0.047061060
## C7 0.039129964
## C8 0.068218969
# and with lambda's other than default:
cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-80:80)/20))
plot(cvRidgeRes)
Similarly to what was observed for variable selection methods above,
plot of cross-validation error for ridge regression has well-defined
minimum indicating that some amount of regularization is necessary for
the model using all attributes to prevent overfitting. Notice that
minimum \(MSE\simeq 1.25\) from ridge
regression here is very comparable to the minimum observed above for
average test error when variables were selected by
regsubsets.
Relatively higher contributions of C1 and C8 to the model outcomed are more apparent for the results of ridge regression performed on centered and, more importantly, scaled matrix of predictors:
ridgeResScaled <- glmnet(scale(x),y,alpha=0)
plot(ridgeResScaled)
cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0)
plot(cvRidgeResScaled)
predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.453178693
## seasonspring 0.003691324
## seasonsummer -0.005260707
## seasonwinter -0.016378543
## sizemedium -0.024054178
## sizesmall_ -0.028723168
## velocitylow___ 0.034721169
## velocitymedium 0.042665971
## C1 0.114287215
## C2 -0.014465684
## C3 0.068388149
## C4 0.001705746
## C5 0.004647950
## C6 0.059777100
## C7 0.044104199
## C8 0.097283518
Lasso regression is done by the same call to glmnet
except that now alpha=1. One can see now how more
coefficients become zeroes with increasing amount of shrinkage. Notice
that amount of regularization increases from right to left when plotting
output of glmnet and from left to right when plotting
output of cv.glmnet.
lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)
cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)
# With other than default levels of lambda:
cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-120:0)/20))
plot(cvLassoRes)
Also well-defined minimum of cross-validation MSE for lasso regularization.
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -2.58075850
## seasonspring .
## seasonsummer .
## seasonwinter .
## sizemedium .
## sizesmall_ .
## velocitylow___ .
## velocitymedium .
## C1 0.45361507
## C2 .
## C3 0.04081512
## C4 .
## C5 .
## C6 .
## C7 .
## C8 0.13480989
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.89634959
## seasonspring .
## seasonsummer .
## seasonwinter .
## sizemedium -0.02207742
## sizesmall_ .
## velocitylow___ .
## velocitymedium 0.07118428
## C1 0.68743275
## C2 .
## C3 0.09713636
## C4 .
## C5 .
## C6 0.04660259
## C7 .
## C8 0.16402891
As explained above and illustrated in the plots for the output of
cv.glmnet, lambda.1se typically corresponds to
more shrinkage with more coefficients set to zero by lasso. Use of
scaled predictors matrix makes for more apparent contributions of C1 and
C8, and to smaller degree, C3:
lassoResScaled <- glmnet(scale(x),y,alpha=1)
plot(lassoResScaled)
cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
plot(cvLassoResScaled)
predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.45317869
## seasonspring .
## seasonsummer .
## seasonwinter .
## sizemedium .
## sizesmall_ .
## velocitylow___ .
## velocitymedium .
## C1 0.20081746
## C2 .
## C3 0.03376726
## C4 .
## C5 .
## C6 .
## C7 .
## C8 0.18722644
Lastly, we can run lasso on several training datasets and calculate corresponding test MSE and frequency of inclusion of each of the coefficients in the model:
lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1)
lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1)
lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 1.346731
lassoCoefCnt
## seasonspring seasonsummer seasonwinter sizemedium sizesmall_
## 0 0 0 0 0
## velocitylow___ velocitymedium C1 C2 C3
## 1 1 28 0 7
## C4 C5 C6 C7 C8
## 0 0 9 0 25
One can conclude that typical lasso model includes two, sometimes three, coefficients and (by comparison with some of the plots above) that its test MSE is about what was observed for two to three variable models as chosen by the best subset selection approach.
We will use the same fundraising dataset from the week 4 problem set (properly preprocessed: shifted/log-transformed, original predictions provided alongside the data being excluded).
contrib by all the methods available in
regsubsets.It is up to you whether you want to include gender
attribute in your analysis. It is a categorical attribute and as such it
has to be correctly represented by dummy variable(s).
If you do it properly (and the preface supplies examples of doing
that), you will be getting three extra points for each problem in this
week assignment that (correctly!) includes gender variable
into the analysis, for the possible total extra of 3x4=12
points.
If you prefer not to add this extra work, then excluding
gender from the data at the outset (as you were instructed
to do for week 4) is probably the cleanest way to prevent it from
getting in the way of your computations; in that case the total points
you can earn for this assignment reverts to the standard 60 points.
#process data
fundata = read.csv("fund-raising-with-predictions.csv")
fundata$gender = as.numeric(fundata$gender == "F") #female = 1; male = 0
data = subset(fundata, select = c(-predcontr))
data[,-13] = log(data[,-13] + 1)
#subset selection
summetrics = NULL
whichAll = list()
for (method in c("exhaustive", "backward", "forward", "seqrep")){
rsresults = regsubsets(contrib ~ ., data = data, method = method, nvmax = 12)
sumresults = summary(rsresults)
whichAll[[method]] = sumresults$which
for (metric in c("rsq", "rss", "adjr2", "cp", "bic"))
{
summetrics = rbind(summetrics, data.frame(
method = method,
metric = metric,
nvars = 1:length(sumresults[[metric]]),
value = sumresults[[metric]])
)
}
}
#plot metrics
ggplot(summetrics, aes(x = nvars, y = value, shape = method, color = method)) +
geom_path() +
geom_point() +
facet_wrap(~metric, scales = "free") +
theme(legend.position = "top") +
theme_bw() +
scale_x_continuous(breaks = seq(0, 12, by = 2))
#which variables
par(mfrow = c(2,2), ps = 14, mar = c(5, 7, 2, 1))
for (method in names(whichAll) ) {
image(1:nrow(whichAll[[method]]),
1:ncol(whichAll[[method]]),
whichAll[[method]],
xlab="N(vars)",
ylab="",
xaxt="n",
yaxt="n",
breaks=c(-0.5,0.5,1.5),
col=c("white","pink"),
main=method)
axis(1,1:nrow(whichAll[[method]]),rownames(whichAll[[method]]))
axis(2,1:ncol(whichAll[[method]]),colnames(whichAll[[method]]), las = 2)
}
The plots show that each of the methods are generally in agreement and show 7 variables appear to be optimal, with the highest AdjR2 and R2, and lowest BIC, Cp, and RSS. At 4 variables, the backward method shows a slightly worse mode than the other methods. At 7 and 9 variables, sequential replacement selects inferior models compared to the other methods.
All of the methods include avecontr first, followed by lastcontr and maxdate. After that, all of the methods include mincontrib, with the exeption of the backward method, which chose mindate. Mindate is the fifth variable for all of the other methods; however, the fifth variable for the backward method is maxcontrib. For 6 and 7 variables, the forward method selected ncontrib and maxcontrib, respectively. The backward method selected ncontrib and mincontrib. Seqrep showed both maxcontrib and ncontrib for 6, and both promocontr and gapmos for 7, which were both ranked lower in the other methods. Gender and age were included 11th and 12th in all of the methods.
regsubsets.which field of the object returned by
regsubsets, investigate stability of variable selection at
each model size across multiple selections of training/test data (again,
follow the example from the Preface).predcontr attribute)?#split testing / training
predict.regsubsets = function (object, newdata, id, ...){
form = as.formula(object$call[[2]])
mat = model.matrix(form,newdata)
coefi = coef(object, id=id)
xvars=names(coefi)
mat[,xvars] %*% coefi
}
dfTmp = NULL
whichSum = array(0,dim=c(12,13,4),
dimnames = list(NULL,colnames(model.matrix(contrib ~.,data)),
c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 30 times:
nTries = 30
for (iTry in 1:nTries ) {
bTrain = sample(rep(c(TRUE,FALSE),length.out=nrow(data)))
for (jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
rsTrain = regsubsets(contrib ~ ., data[bTrain,], nvmax=12, method=jSelect)
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
#calculate test error for each set of variables (each model size)
for (kVarSet in 1:12 ) {
testPred = predict(rsTrain, data[!bTrain,],id=kVarSet)
mseTest = mean((testPred-data[!bTrain,"contrib"])^2)
dfTmp = rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
ggplot(dfTmp,aes(x=factor(vars),y=mse,color=sel)) + geom_boxplot()+facet_wrap(~trainTest)+theme_bw()
predcontrmse = mean((data$contrib - log(fundata$predcontr + 1)) ^ 2)
paste("predcontr MSE:", round(predcontrmse, 2))
## [1] "predcontr MSE: 0.17"
With 7 variables, MSE appears to be the lowest for both the testing and training data, at below 0.12 median with slightly lower training error. The error rate remains stable with the addition of more variables, suggesting that the additional variables will not add much value. The different methods are largely in agreement with eachother; however, seqrep shows higher MSE at some model sizes compared to the other methods. This error is lower than the MSE of the predcontr predictions, which was approximately 0.17.
glmnet and cv.glmnet
results. Compare the values of the model coefficients at the value of
\(\lambda\) corresponding to the
minimum of cross-validated test MSE and to the test MSE which is 1SE
away from the minimum. Which model coefficients are set to zero?lambda passed to
cv.glmnet and discuss the results.#lasso regression model
x = model.matrix(contrib ~ ., data)[,-1]
y = data[,"contrib"]
lass = glmnet(x, y, alpha = 1)
cvlass = cv.glmnet(x, y, alpha = 1)
par(mfrow = c(1,2))
plot(lass); plot(cvlass)
cvlass$lambda.min; cvlass$lambda.1se
## [1] 0.001625733
## [1] 0.03502535
predict(lass, type = "coefficients", s = cvlass$lambda.min)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.7338310744
## gapmos 0.0023251184
## promocontr -0.0051650684
## mincontrib -0.0506616350
## ncontrib -0.0522508602
## maxcontrib 0.1228826059
## lastcontr 0.3405551248
## avecontr 0.5033110372
## mailord -0.0030292650
## mindate -1.0866780143
## maxdate 1.6459934869
## age .
## gender -0.0001368809
predict(lass, type = "coefficients", s = cvlass$lambda.1se)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.1837284392
## gapmos .
## promocontr .
## mincontrib .
## ncontrib -0.0005258808
## maxcontrib 0.0980294430
## lastcontr 0.3600084275
## avecontr 0.4300831123
## mailord .
## mindate .
## maxdate 0.4990649734
## age .
## gender .
cvLassRes = cv.glmnet(x,y,alpha=1,lambda=10^((-80:80)/20))
plot(cvLassRes)
cvLassRes2 = cv.glmnet(x, y, alpha = 1, lambda = 10^((-120:0)/20))
plot(cvLassRes2)
The model coefficients using 1SE are lower than some of the coefficients using lambda.min. For example avecontr (1SE) is 0.43, wheras avecontr (min) is 0.49. However, this trend is not consistent across all of the variables. Fewer coefficients are included with 1SE than with min. 1SE includes ncontrib, maxcontrib, lastcontr, avecontr, and maxdate. Min includes all of these plus mailord and mindate. The range of log(lambda) below approximately -4 or -5 resulted in the low plateau of MSE, to the level shown above. Log(lambda) above approximately 0 resulted in a high MSE plateau around 0.33.
predcontr variable) and to the
errors of the models you fitted earlier – discuss the results.lassoCoefCnt = 0
lassoMSE = NULL
for (iTry in 1:30 ) {
bTrain = sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
cvLassoTrain = cv.glmnet(x[bTrain,],y[bTrain],alpha=1)
lassoTrain = glmnet(x[bTrain,],y[bTrain],alpha=1)
lassoTrainCoef = predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
lassoCoefCnt = lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
lassoTestPred = predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
lassoMSE = c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.1277493
lassoCoefCnt
## gapmos promocontr mincontrib ncontrib maxcontrib lastcontr avecontr
## 0 0 0 0 28 30 30
## mailord mindate maxdate age gender
## 0 0 6 0 0
Through resampling, the typical model size includes at least 3 variables, but no more than 6, which contrasts with the previous best subset selection which showed the lowest error with 7 variables. The mean MSE is 0.126, which is slightly higher than the minimum MSE seen with the best subset selection above.